second term
Statistical Convergence of Spherical First Hitting Diffusion Models
Bienewald, Simon, Trottner, Lukas
Denoising diffusion models have evolved into a state-of-the-art method for tasks in various fields, such as denoising and generation of images, text generation, or generation of synthetic data for training of other machine learning models. First hitting diffusion models (FHDM) are a particular class of denoising diffusion models with \textit{random} adaptive generation time tailored to generate data on a known manifold. Building on the conditioning framework of Doob's $h$-transform these models leverage the given information on the target data manifold to demonstrate strong performance across tasks while offering distinct features such as time-homogeneous dynamics of the generating process and a reduced average simulation time. Even though the theoretical investigation of standard forward-backward diffusion models has attracted much attention in the recent past, the statistical convergence properties of FHDMs are not yet understood. In this work, we show that, up to logarithmic factors, FHDMs achieve the minimax optimal convergence rate in total variation for spherically supported Sobolev smooth data distributions. In particular, this is the first statistical optimality result for denoising diffusion modelling with random generation time.
1c71cd4032da425409d8ada8727bad42-Supplemental-Conference.pdf
We can see that the error for the first term is mainly due to the sample approximation. We therefore refer to the first term as the Variance. We refer to the second term as the Bias. Our proof of convergence of the bias adapts the proof in [31, Theorem 6] and [11], and utilizes the fact that CY|X is Hilbert-Schmidt to obtain a sharp rate. A.1 Bounding the Bias In this section, we establish the bound on the bias.
Supplementary material for Discrete Valued Neural Communication in Structured Architectures Enhances Generalization
In this appendix, as a complementary to Theorems 1-2, we provide additional theorems, Theorems 3-4, which further illustrate the two advantages of the discretization process by considering an abstract model with the discretization bottleneck. For the advantage on the sensitivity, the error due to potential noise and perturbation without discretization -- the third term ξ(w,r0,M0,d) >0 in Theorem 4 -- is shown to be minimized to zero with discretization in Theorems 3. See Appendix C.1 for a simple comparison between the bound of Theorem 3 and that of Theorem 4 when the metric spaces (M,d) and (M0,d0) are chosen to be Euclidean spaces. We now introduce the notation used in Theorems 3-4. Here, ϕw represents a deep neural network with weight parameters w W RD, qe is the discretization process with the codebook e E RL m, and hθ represents a deep neural network with parameters θ Θ Rζ. Thus, the tuple of all learnable parameters are (w,e,θ).
Derivations of Formulas
We have omitted a number of complicated formulas in the main text to provide clear intuition and concise proof sketch. We will list all mentioned formulas here for readers' reference. We consider the case where U = V = Aand Σ is symmetric and full-rank, and we use gradient flow. We can derive the dynamics of S = AA>as S:= (Σ S)S+ S(Σ S), which is a quadratic ordinary differential equation and it is hard to solve directly. For simplicity, define X:= X Σ 1. Then X = XΣ ΣX. (24) Solving this equation and we have And it is interesting to verify that S(t) + P(t) Σ by using the following lemma.